
This tutorial serves as the final project for CMSC641 at University of Maryland.
AllRecipes is a website that allows users to share their recipes and discover new receipes that other users have shared. Users can indicate that they attempted the recipe, rate it and provide comments.
The objective of this tutorial is to discover how ingredients contribute to the success of a recipe. Along the way, we will examine other factors such as cook time and preparation steps and evaluate their contribution to a recipe's rating. We will scrape data directly from the AllRecipes site, build a sqlite database to store the metadata for each recipe, clean the data, and draw out insights. The majority of this tutorial will focus on data acquisition, storage and and processing.
In order to execute this tutorial you will need a Python 3 installation. We suggest installing using Anaconda.
You will also need the following Python packages:
We will also make use of the NYTimes Ingredient-Phrase-Tagger to clean the freeform ingredients section of each recipe. At the moment of writing, this project isn't fully Python3 compatible due to print statements missing parenthesis. I've submitted a pull request to correct. There are also Pandas warnings that need to be addressed, but they do not currently break the code.
Before progressing, you will need to clone the repo to the same directory as this project and make sure that you have crf++. Please follow the directions they provide.
AllRecipes no longer provides free access to their Recipes API. Therefore, we will need to scrape data directly from their site using Requests and BeautifulSoup. We will need to crawl every recipe in the AllRecipes domain and parse the HTML to find elements that contain data of interest. A quick check of the site's robots.txt file shows us that this usage is allowable, but that we will need to limit requests to 1 per second.
To start with, we will need to compile a list of URLs for each recipe that we want to capture. The homepage features an infinite scroll of recipe cards, so we can paginate through each set of cards and record the recipe URL. Manually, using binary search I discovered the "last page" of recipe results and recorded the page number so that I knew when to stop paging through results.
import requests
import time
import csv
from bs4 import BeautifulSoup
f = open('allrecipe_urls.csv', 'a+')
out = csv.writer(f)
base_url = "http://allrecipes.com/recipes/?sort=Title&page="
for page in range(1,3240):
time.sleep(1)
url = base_url + str(page)
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
htmlUrls = soup.find_all('article', {'class': "fixed-recipe-card"})
processed = [[i.find('a').get('href')] for i in htmlUrls]
out.writerows(processed)
f.close()
As of November 27, 2018, there were 64,583 active recipes which suggests it will take approximately 18 hours to scrape each recipe. However, I found that this process took closer to 5 days due to connection loss and unforseen issues.
Because the data is expensive to obtain from a time perspective, it is important that it is written to disk and preserved for later use. For this implementation, we've chosen to use a local Sqlite database due to the variable length nature of the data; recipes can have 0-n ingredients and 0-m steps. If we used a flat text file, we would have to commit to a maximum amount of ingredients/steps ahead of time, before we even visited any URLs.
The AllRecipes Database will consist of four tables: recipes, directions, nutrition and ingredients; which describe the four types of data we will be harvesting. The recipes table will include metadata about the recipe such as the author, number of reviews, description and prep time. The directions table will have a row for each preparation step for each recipe. Similarly, the ingredients table will have a row for each ingredient in each recipe. Finally, the nutrition table will have a row for each recipe (that has nutrition facts listed) and will contain data such as the calorie count and fat content, per serving.
We will now initialize the database and define the tables listed above. First, we will define some helper functions that will format and create the tables and functions that we will use later to insert rows into each table. In the second cell, we will call the create_db() function to build the database.
import sqlite3
def create_conn(db):
try:
conn = sqlite3.connect(db)
return conn
except Exception as e:
print(e)
return None
def create_table(conn, table_sql):
try:
cur = conn.cursor()
cur.execute(table_sql)
except Exception as e:
print(e)
def create_db(db):
sql_recipe = """ CREATE TABLE IF NOT EXISTS recipes (
id integer PRIMARY KEY,
url text NOT NULL,
title text,
author text,
description text,
num_photos integer,
prep_time text,
cook_time text,
total_time text,
rating real,
reviews integer,
made_it integer,
servings integer,
num_steps integer,
num_ingredients integer
); """
sql_directions = """CREATE TABLE IF NOT EXISTS directions (
id integer PRIMARY KEY,
recipe_id integer NOT NULL,
step text NOT NULL,
step_order integer NOT NULL,
FOREIGN KEY (recipe_id) REFERENCES recipes(id)
);"""
sql_nutrition = """CREATE TABLE IF NOT EXISTS nutrition (
id integer PRIMARY KEY,
recipe_id integer NOT NULL,
calories real,
fat real,
carbs real,
protein real,
cholesterol real,
sodium real,
FOREIGN KEY (recipe_id) REFERENCES recipes(id)
);"""
sql_ingredients = """CREATE TABLE IF NOT EXISTS ingredients (
id integer PRIMARY KEY,
recipe_id integer NOT NULL,
ingredient text NOT NULL,
FOREIGN KEY (recipe_id) REFERENCES recipes(id)
);"""
conn = create_conn(db)
if conn is not None:
create_table(conn, sql_recipe)
create_table(conn, sql_directions)
create_table(conn, sql_nutrition)
create_table(conn, sql_ingredients)
conn.close()
def update_recipe(cur, summary, recipe_id):
sql = """UPDATE recipes
SET title = ?,
author = ?,
description = ?,
num_photos = ?,
prep_time = ?,
cook_time = ?,
total_time =?,
rating = ?,
reviews = ?,
made_it = ?,
servings = ?,
num_steps = ?,
num_ingredients = ?
WHERE id = ?
"""
values = summary + (recipe_id,)
cur.execute(sql, values)
def insert_directions(cur, directions, recipe_id):
sql = """ INSERT INTO directions (recipe_id, step_order, step)
VALUES (?, ?, ?)"""
for i in directions:
values = (recipe_id,) + i
cur.execute(sql, values)
def insert_ingredients(cur, ingredients, recipe_id):
sql = """ INSERT INTO ingredients (recipe_id, ingredient)
VALUES (?, ?)"""
for i in ingredients:
values = (recipe_id,) + i
cur.execute(sql, values)
def insert_nutrition(cur, nutrition, recipe_id):
sql = """ INSERT INTO nutrition
(recipe_id, calories, fat, carbs, protein,
cholesterol, sodium)
VALUES (?, ?, ?, ?, ?, ?, ?)"""
values = (recipe_id,) + nutrition
cur.execute(sql, values)
The cell below creates the database.
create_db('allrecipes.db')
Now that we have the database setup, we will insert the URLs that we've already collected.
import sqlite3
f = open('allrecipe_urls.csv', 'r')
reader = csv.reader(f)
conn = db_builder.create_conn('allrecipes.db')
sql_urls = """INSERT INTO recipes (url) VALUES (?);"""
for url in reader:
cur = conn.cursor()
cur.execute(sql_urls, (url[0],))
conn.commit()
conn.close()
We have created the database and we have functions to insert data, but we still need code to process the HTML and capture the tags that we are interested in. Once again, we will utilize helper functions to modularize the code.
Each function will take as input a Beautiful Soup HTML object. The function will then use the find or find_all method to locate the HTML tag of interest. Finally, each will return a tuple output. We use tuples here because they are the data structure that Sqlite expects. In some places we need to use try/except statements because the html tag is not always present. For example, in the case that an author does not include a description for the recipe, there is no description tag present in the HTML.
def get_ingredients(soup):
ingred = [i.text.strip() for i in
soup.find_all('li', {'class': 'checkList__line'})][0:-3]
return tuple((i,) for i in ingred)
def get_directions(soup):
steps = [i.text.strip() for i in
soup.find_all('span', {'class': 'recipe-directions__list--item'})]
steps = list(filter(None, steps))
return tuple(zip(range(len(steps)), steps))
def get_nutrition(soup):
nutrition = soup.find('div', {'class': 'nutrition-summary-facts'})
calories = nutrition.find('span', {'itemprop': 'calories'}).text.strip(' calories;')
fat = nutrition.find('span', {'itemprop': 'fatContent'}).text.strip()
carbs = nutrition.find('span', {'itemprop': 'carbohydrateContent'}).text.strip()
protein = nutrition.find('span', {'itemprop': 'proteinContent'}).text.strip()
cholesterol = nutrition.find('span', {'itemprop': 'cholesterolContent'}).text.strip()
sodium = nutrition.find('span', {'itemprop': 'sodiumContent'}).text.strip()
return (calories, fat, carbs, protein, cholesterol, sodium)
def get_basics(soup):
title = soup.find('title').text.split(' - ')[0]
num_photos = soup.find('span', {'class': 'picture-count-link'}).text.strip(' photos')
author = soup.find('span', {'class': 'submitter__name'}).text
summ_stats = soup.find('div', {'class': 'recipe-summary__stars'})
rating_long = summ_stats.find('div', {'class': 'rating-stars'}).attrs['data-ratingstars']
num_reviews = summ_stats.find('meta', {'itemprop': 'reviewCount'}).attrs['content']
made_it = soup.find('span', {'class': 'made-it-count'}).find_next().text.strip('\xa0made it')
servings = soup.find('meta', {'id': 'metaRecipeServings'}).attrs['content']
try:
description = soup.find('div', {'class': 'submitter__description'}).text.strip()
except:
description = None
try:
prep_time = soup.find('time', {'itemprop': 'prepTime'}).attrs['datetime'].strip('PT')
except:
prep_time = None
try:
cook_time = soup.find('time', {'itemprop': 'cookTime'}).attrs['datetime'].strip('PT')
except:
cook_time = None
try:
total_time = soup.find('time', {'itemprop': 'totalTime'}).attrs['datetime'].strip('PT')
except:
total_time = None
return (title, author, description, num_photos, prep_time,
cook_time, total_time, rating_long, num_reviews,
made_it, servings)
We will define a function that will execute each of the helper functions and combine their output.
def fetch_data(soup):
basics = get_basics(soup)
directions = get_directions(soup)
ingredients = get_ingredients(soup)
basics_ext = basics + (len(directions), len(ingredients))
return (basics_ext, directions, ingredients)
At this point we have code to process the BeautifulSoup and capture the page data that we're interested in. However, we still need to write code to visit each URL, call the processing functions, and then call the database functions to insert the data.
def url_to_soup(url):
r = requests.get(url)
return BeautifulSoup(r.text, 'html.parser')
def process_html(recipe_id, conn, soup):
try:
summary, directions, ingredients = fetch_data(soup)
cur = conn.cursor()
update_recipe(cur, summary, recipe_id)
insert_directions(cur, directions, recipe_id)
insert_ingredients(cur, ingredients, recipe_id)
try:
nutrition = get_nutrition(soup)
insert_nutrition(cur, nutrition, recipe_id)
except Exception as e:
print('No nutrition elements: ', recipe_id)
print(e)
conn.commit()
except Exception as e:
conn.rollback()
print(e, recipe_id)
Now we are finally ready to put it all together and crawl AllRecipes. The following code is written such that it can be paused and restarted without losing data or revisiting recipes that have already been crawled.
First, we query the database to obtain the URLs of all recipes that do not have metadata and then iterate through each URL. The query returns URLs in a random order so if we do not have time to collect every recipe, we can select a random subset. For each URL we use the url_to_soup function to obtain the HTML and then process_html to locate the elements of interest and insert the data into our database. Once again, we utilize Exception handling to handle URLs that are no longer active or a faulty connection.
missing_recipes = """SELECT id, url FROM recipes WHERE title is null ORDER BY random()"""
conn = create_conn('allrecipes.db')
cur = conn.cursor()
cur.execute(missing_recipes)
for row in cur:
recipe_id, url = row
time.sleep(1)
try:
soup = url_to_soup(url)
process_html(recipe_id, conn, soup)
except Exception as e:
print(e, row)
When this code was last executed, we were only able to successfully scrape 64,368 recipes, missing 215. When we manually visited the URLs that we couldn't capture we noted that some resulted in 404 errors because the recipe had been deleted, but the majority were active. However, the styling of the page was very different from the successfully captured URLs. The scraper failed for these pages because the page elements of interest were relocated in the HTML. Below is an example of an outlier recipe.
Instead of writing a second set of code to scrape these outlier pages, we decided to leave them out.

From this point on, we will be using Pandas and Numpy to clean up our data and prepare it for exploratory analysis. We will read the recipes metadata into a dataframe and take a cursory glance to identify areas in need of further processing. Later, we will take a look at ingredients, directions and nutrition.
import pandas as pd
conn = create_conn('allrecipes.db')
# only pull metadata for recipes we successfully crawled
sql = 'SELECT * FROM recipes WHERE title is not null;'
recipes = pd.read_sql(sql=sql, con=conn)
Most of the data that we collected is self-explanatory, but I will note that made_it represents the number of users that self-reported attempting to cook the recipes.
recipes.info()
len(recipes[recipes.made_it.str.contains('k') == True]['made_it'])
recipes.head(1)
Immediately we can see that we will need to correct the three time variables. They are currently saved as strings in an odd format, number-hours "H" number-minutes "M", and we will need to convert them to timestamp. We will use regex to identify the hours and minutes numbers then create a new column in the dataframe with the combined time.
def fix_time(ser):
temp = ser.str.extract(r"(?:(\d+)[D][a][y][s]*)?(?:(\d+)[H])?(\d+)[M]?")
temp.fillna(value=0, inplace=True)
temp = temp.astype('int32')
temp['final'] = pd.to_datetime(temp[0]*60*24 + temp[1]*60 + temp[2], unit='m').dt.strftime('%d:%H:%M')
return temp['final']
recipes['prep_time'] = fix_time(recipes.prep_time)
recipes['cook_time'] = fix_time(recipes.cook_time)
recipes['total_time'] = fix_time(recipes.total_time)
There is also an issue with the made_it column; it contains both integer and string values. Any recipe that has been made more than 1,000 times does not have an integer value. Instead it is rounded to the nearest 1,000 and expressed in the form "1k". Let's correct this using regex.
recipes.made_it = recipes.made_it.astype('str').str.extract(r"(\d+)[k]*")
recipes.made_it = recipes.made_it.astype('int32')
recipes.head(1)
Now that the data is clean, we can start to explore the relationship between ingredients and ratings. We will continue to utilize Pandas for the heavy lifting, but we will also use Seaborn to visualize relationships.
First we will take a look at the average recipe rating by total number of ingredients.
%matplotlib inline
import seaborn as sb
data = recipes.groupby(by='num_ingredients').mean()['rating']
ax = sb.lineplot(data=data, color="#34495e", legend='full')
_ = ax.set_title('Rating (grey) and recipes (blue) by Number of Ingredients')
_ = ax.set_ylabel('Average Rating')
_ = ax.set_xlabel('Ingredients')
ax2 = ax.twinx()
ax2 = sb.lineplot(data=recipes.groupby(by='num_ingredients').count()['rating'])
_ = ax2.set_ylabel('Count Recipes')
The mean rating is surprisingly constant with respect to the number of ingredients, but it does slightly decrease as the number of ingredients increase. For recipes with >20 ingredients there is much more variability in the average rating.
This is likely because there are very few recipes with this many ingredients. To test this, we can recreate this graph using the variance of rating instead of the average.
data = recipes.groupby(by='num_ingredients').var()['rating']
ax = sb.lineplot(data=data, color="#34495e", legend='full')
_ = ax.set_title('Rating (grey) and recipes (blue) by Number of Ingredients')
_ = ax.set_ylabel('Rating Variance')
_ = ax.set_xlabel('Ingredients')
ax2 = ax.twinx()
ax2 = sb.lineplot(data=recipes.groupby(by='num_ingredients').count()['rating'])
_ = ax2.set_ylabel('Count Recipes')
Variance shows a positive correlation to the number of ingredients. This trend is present, even for recipes with <20 ingredients. Even though the average rating does not decrease by much, users are more polarized in the ratings that they provide.
Finally, let's check to see if there's a relationship between the number of ingredients and the amount of users who self-report that they cooked the recipe.
data = recipes.groupby(by='num_ingredients').mean()['made_it']
ax = sb.lineplot(data=data, color="#34495e", legend='full')
_ = ax.set_title('Made_it (grey) and recipes (blue) by Number of Ingredients')
_ = ax.set_ylabel('Average made_it')
_ = ax.set_xlabel('Ingredients')
ax2 = ax.twinx()
ax2 = sb.lineplot(data=recipes.groupby(by='num_ingredients').count()['rating'])
_ = ax2.set_ylabel('Count Recipes')
The more ingredients a recipe has, the less likely users are to attempt cooking it. This could be due to the complexity of the recipe or the fact that recipes with a high ingredient count are infrequent; users have a hard time finding them.
To limit the scope of this tutorial, I decided to use a part of speech tagger specifically trained on Recipes by the NYTimes engineering team. I will briefly run through the usage of the model, but you can visit the github page yourself to learn more.
We will run the ingredient list through the tagger so that we can separate ingredients from measurements or other text. For example, if a recipe calls for "2 cups shredded mozzarella cheese" we would like to identify "2 cups" as a measurement, "shredded" as a verb describing the state of the ingredient and "mozzarella cheese" as the base ingredient. For this tutorial we will focus our analysis only on the base ingredients.
To start, we will follow their directions and train the model on a subset of their data using the default parameters. I will collapse the output since it prints quite a lot to console.
import os
os.chdir('ingredient-phrase-tagger/')
! ./roundtrip.sh
Now that the model is trained, we can use it on our dataset. We will once again use the tools provided. We will need to fetch all of the recipe ingredients and write them to a text file to use as input.
sql = 'SELECT * FROM ingredients;'
ingred = pd.read_sql(sql=sql, con=conn)
ingred.ingredient.to_csv("ingredients_raw.csv", index=False)
! python bin/parse-ingredients.py ingredients_raw.csv > results.txt
! python bin/convert-to-json.py results.txt > results.json
We will read the JSON file into pandas and then perform a merge to combine it with the ingredients data we already have. Here we are merging on indexes because we know that the data has not been reordered.
ingred_nyt = pd.read_json('results.json')
ingred_total = ingred.merge(ingred_nyt, how='left', left_index=True, right_index=True)
Because we're only interested in the basic ingredient, we will reduce the columns in the dataset to the basic ingredient name and recipe_id. We'll need to drop duplicates because the same basic ingredient can appear multiple times in the same recipe. We don't have to look farther than the first 5 ingredients to see why...
HINT: strawberries
ingred_total.head()
ingred_dedup = ingred_total[['recipe_id','name']].drop_duplicates()
ingred_dedup.dropna(inplace=True)
To get started, we will add a new column to the dataframe to summarize the frequency of each ingredient in the entire dataset.
ingred_dedup['freq'] = ingred_dedup.groupby('name')['name'].transform('count')
ingred_dedup.freq.describe()
It's refreshing to see that there are many ingredients with a high frequency in our dataset. This bodes well for our future analysis. However, there are some ingredients that only appear once. It looks like these are a combination of obscure or overly specific ingredients as well as mistakes by the tagging model.
ingred_dedup.loc[ingred_dedup.freq == 1].head(10)
len(ingred_dedup.name.value_counts())
There are 18,394 unique ingredients in our set. We will pare this down slightly to make it more manageable. Then we will create a concatenated column that has a comma separated string of every ingredient that appears in the recipe.
ingred_reduced = ingred_dedup.loc[ingred_dedup.freq >= 500].copy()
ingred_reduced['text'] = ingred_reduced[['recipe_id','name']].groupby(['recipe_id'])['name'].transform(lambda x: ','.join(x))
Using CountVectorizer in sklearn we can efficiently transform the dataframe into a recipe id by ingredient matrix. Each cell value contains a count of how many times an ingredient appeared in the recipe.
from sklearn.feature_extraction.text import CountVectorizer
tfidf = CountVectorizer(
tokenizer=lambda x: x.lower().split(','),
preprocessor=lambda x: x,
)
x = tfidf.fit_transform(ingred_reduced.text)
To form this into an ingredient by ingredient matrix we will perform some simple matrix multiplication with the matrix transpose. We read this back into a dataframe and add in the feature names (ingredient names) for the index and columns making sure to make all entries on the diagonal 0. Finally, we will perform a modified normalization of the dataframe to force all the values to be positive. We need this condition in order to construct the network graph.
import numpy as np
y = x.T * x
final = pd.DataFrame(y.todense(), columns=tfidf.get_feature_names(), index=tfidf.get_feature_names())
final.head()
final_norm = abs(final - final.values.mean()) / final.values.std()
final_norm.values[[np.arange(169)]*2] = 0
Our final dataset contains only 169 ingredients because of the ingredient reduction that we performed in the last section. Had we not performed that step we would have thousands of ingredients to visualize in a network graph. For a graph of this size, the centrality computations that we are interested in will be incredibly inefficient.
I originally intended to spend the majority of my tutorial on this section, but considering the length requirements of this tutorial, I will keep my analysis brief.
Networkx is a network visualization package that allows you to build, render and run algorithms against network graphs. To build a graph from our final, ingredient by ingredient matrix, we simply read in the data frame and call a draw function.
import networkx
G = networkx.from_pandas_adjacency(final_norm)
networkx.draw_networkx(G, pos=networkx.drawing.layout.kamada_kawai_layout(G),
node_size=100, width=0.1)
As you can see from the giant smush of black, this graph is highly connected meaning that nearly every ingredient is used with every other ingredient in at least one recipe. We're more interested in common groupings of ingredients so we should simplify the graph to drop edges that fall below a certain threshold. For now, let's arbitrarily choose 10.
H = networkx.Graph()
edg =[(u,v) for (u,v,d) in G.edges(data=True) if d['weight'] > 10]
nd = [i for i,j in edg]
H.add_nodes_from(list(set(nd)))
H.add_edges_from(edg)
networkx.draw_networkx(H, pos=networkx.drawing.layout.kamada_kawai_layout(H),
font_size=8)
It should be no surprise that salt is at the center of this graph since it is ubiquitous. More intersting is that even when limited to 11 ingredients, we can see the divide between sweet and savory. The baking ingredients are highly interconnected and form their own community; they are joined to the savory ingredients only by salt.
The savory ingredients are more disjoint, with only salt, onion and garlic forming a triangle.
Let's perform this analysis again, but expanding our threshold parameter to see if the savory ingredients start to show a similar relationship.
H = networkx.Graph()
edg =[(u,v) for (u,v,d) in G.edges(data=True) if d['weight'] > 5]
nd = [i for i,j in edg]
H.add_nodes_from(list(set(nd)))
H.add_edges_from(edg)
from matplotlib import pyplot as plt
plt.figure(figsize=(100,100))
networkx.draw_networkx(H, pos=networkx.drawing.layout.kamada_kawai_layout(H),
font_size=100, node_size=10000)
plt.show()
The savory ingredients are starting to show the same relationship and now there are three bridges; salt, water and butter.
Remember, we are generating these graphs using a normalized ingredient by ingredient matrix. We can start to draw conclusions about the type of recipes included on Allrecipes. There seems to be a high concentration of baked goods as well as, dare I say, italian recipes.
Let's zoom out one more time just for fun.
I = networkx.Graph()
edg =[(u,v) for (u,v,d) in G.edges(data=True) if d['weight'] > 3]
nd = [i for i,j in edg]
I.add_nodes_from(list(set(nd)))
I.add_edges_from(edg)
plt.figure(figsize=(100,100))
networkx.draw_networkx(I, pos=networkx.drawing.layout.kamada_kawai_layout(I),
font_size=100, node_size=10000)
plt.show()
Users on AllRecipes tend to prefer simpler recipes that have less ingredients. Not only do users rate low ingredient recipes higher, but they also are more likely to attempt to cook them themselves.There are two clear themes for recipes: Baked goods and Italian savory meals.
This tutorial provides only a cursory look at the ingredient web. Curious readers are encouraged to explore the graph further by examining measures of centrality and cliques.